https://github.com/rjmaitri/Midterm.git
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(plotly)
library(profileModel)
library(brms)
library(bayesplot)
library(reactable)
library(tidybayes)
library(bayesplot)
library(rstanarm)
library(modelr)
library(loo)
options(mc.cores = parallel::detectCores())
High-throughput sequencing technology provides data on transcript abundance and diversity. This technology allows us to interrogate the transcriptomic effects of chemotherapeutic drug treatments on breast cancer cell (BCC) lines. We can interrogate the transcriptional effects of chemotherapeutics by culturing a BCC line with different combinations of chemotherapuetic agents, thus, interrogate the transcriptional changes that occur under chemothrapy.
The negative binomial distribution is suited for this data because it is count data with a variance that shifts away from the mean with greater gene expression. The variance behaves this way because experiments typically have 3-5 samples with a few highly expressed genes and many, lowly expressed genes. The negative binomial distribution is suited for this data, as it accomodates the variance shifting away from the mean with higher levels of gene expression. Theoretically, a poisson distribution would work with a larger sample size, however, this is financially burdensome.
Random sampling from breast cancer tissue throughout the population would be ideal, however 2/3rds of all in vitro breast cancer research uses three BCC lines, with the the oldest line has been cultured since 1970 and retains many features of the tumor that it originated from. Biological triplicates and experimental replicates enhance the validity of estimation. Improper experimental design and execution create batch effects. Asynchrounously running the experiments, different reagents, inconsistent cell culture techniques can create batch and counfounding effects on the results of the experiment. These need to be minimized in order to provide credibility to the recorded effects of the chemotherapeutic agents.
Further, the genomic instability of cancer creates the possibility for confounding effects, as the differential gene expression seen across treatments may be related to differences within the sample itself.
2a) Access (5 points) Download and read in the data. Can you do this without downloading, but read directly from the archive (+1).
#read raw github from internet
Covid_JHU <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
#look at the file
head(Covid_JHU)
## # A tibble: 6 x 317
## UID iso2 iso3 code3 FIPS Admin2 Province_State
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 8.40e7 US USA 840 1001 Autau~ Alabama
## 2 8.40e7 US USA 840 1003 Baldw~ Alabama
## 3 8.40e7 US USA 840 1005 Barbo~ Alabama
## 4 8.40e7 US USA 840 1007 Bibb Alabama
## 5 8.40e7 US USA 840 1009 Blount Alabama
## 6 8.40e7 US USA 840 1011 Bullo~ Alabama
## # ... with 310 more variables: Country_Region <chr>,
## # Lat <dbl>, Long_ <dbl>, Combined_Key <chr>,
## # `1/22/20` <dbl>, `1/23/20` <dbl>, `1/24/20` <dbl>,
## # `1/25/20` <dbl>, `1/26/20` <dbl>, `1/27/20` <dbl>,
## # `1/28/20` <dbl>, `1/29/20` <dbl>, `1/30/20` <dbl>,
## # `1/31/20` <dbl>, `2/1/20` <dbl>, `2/2/20` <dbl>,
## # `2/3/20` <dbl>, `2/4/20` <dbl>, `2/5/20` <dbl>,
## # `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>,
## # `2/9/20` <dbl>, `2/10/20` <dbl>, `2/11/20` <dbl>,
## # `2/12/20` <dbl>, `2/13/20` <dbl>, `2/14/20` <dbl>,
## # `2/15/20` <dbl>, `2/16/20` <dbl>, `2/17/20` <dbl>,
## # `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>,
## # `2/21/20` <dbl>, `2/22/20` <dbl>, `2/23/20` <dbl>,
## # `2/24/20` <dbl>, `2/25/20` <dbl>, `2/26/20` <dbl>,
## # `2/27/20` <dbl>, `2/28/20` <dbl>, `2/29/20` <dbl>,
## # `3/1/20` <dbl>, `3/2/20` <dbl>, `3/3/20` <dbl>,
## # `3/4/20` <dbl>, `3/5/20` <dbl>, `3/6/20` <dbl>,
## # `3/7/20` <dbl>, `3/8/20` <dbl>, `3/9/20` <dbl>,
## # `3/10/20` <dbl>, `3/11/20` <dbl>, `3/12/20` <dbl>,
## # `3/13/20` <dbl>, `3/14/20` <dbl>, `3/15/20` <dbl>,
## # `3/16/20` <dbl>, `3/17/20` <dbl>, `3/18/20` <dbl>,
## # `3/19/20` <dbl>, `3/20/20` <dbl>, `3/21/20` <dbl>,
## # `3/22/20` <dbl>, `3/23/20` <dbl>, `3/24/20` <dbl>,
## # `3/25/20` <dbl>, `3/26/20` <dbl>, `3/27/20` <dbl>,
## # `3/28/20` <dbl>, `3/29/20` <dbl>, `3/30/20` <dbl>,
## # `3/31/20` <dbl>, `4/1/20` <dbl>, `4/2/20` <dbl>,
## # `4/3/20` <dbl>, `4/4/20` <dbl>, `4/5/20` <dbl>,
## # `4/6/20` <dbl>, `4/7/20` <dbl>, `4/8/20` <dbl>,
## # `4/9/20` <dbl>, `4/10/20` <dbl>, `4/11/20` <dbl>,
## # `4/12/20` <dbl>, `4/13/20` <dbl>, `4/14/20` <dbl>,
## # `4/15/20` <dbl>, `4/16/20` <dbl>, `4/17/20` <dbl>,
## # `4/18/20` <dbl>, `4/19/20` <dbl>, `4/20/20` <dbl>,
## # `4/21/20` <dbl>, `4/22/20` <dbl>, `4/23/20` <dbl>,
## # `4/24/20` <dbl>, `4/25/20` <dbl>, `4/26/20` <dbl>, ...
2b) It’s big and wide! (10 Points) The data is, well, huge. It’s also wide, with dates as columns. Write a function that, given a state, will output a time series (long data where every row is a day) of cummulative cases in that state as well as new daily cases.
#write a function/figure out way to update by day
state_function <- function(x) {
#select states and dates
clean_covid <- Covid_JHU %>%
select(state = 7, 12:305)
#insert pivot long here
#pivot long
Covid_long <- pivot_longer(clean_covid,
cols = !state,
names_to = "Date",
values_to = "Cases")
#lubridates
Covid_dates <- Covid_long %>%
mutate(Date = mdy(Date))
#group by date, state and summarize cases
tidying_up <- Covid_dates %>%
group_by(Date, state) %>%
summarise(Cases = sum(Cases))
#filter by function input
tidy_covid <- tidying_up %>%
rowwise() %>%
filter(state == x)
return(tidy_covid)
}
Mass_data <- state_function("Massachusetts")
reactable(Mass_data)
2c) Let’s get visual! (10 Points) Great! Make a compelling plot of the timeseries for Massachusetts! Points for style, class, ease of understanding major trends, etc. Note, 10/10 only for the most killer figures. Don’t phone it in! Also, note what the data from JHU is. Do you want the cummulatives, or daily, or what?
p <- Mass_data %>%
ggplot( aes(x=Date, y=Cases)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color="#69b3a2") +
ylab("Covid 19 Infections)") +
theme_dark()
# Turn it interactive with ggplotly
p <- ggplotly(p)
p
2d) At our fingertips (10 Points) Cool. Now, write a function that will take what you did above, and create a plot for any state - so, I enter Alaska and I get the plot for Alaska! +2 if it can do daily or cumulative cases - or cases per 100,000 if you did that above. +3 EC if you highlight points of interest - but dynamically using the data. Note, you might need to do some funky stuff to make things fit well in the plot for this one. Or, meh.
#modularize state data and plot
state_plot <- function(x){
#state function
data <- state_function(x)
#create a ggplot object with state data
p <- data %>%
ggplot( aes(x=Date, y=Cases)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color="#69b3a2") +
ylab("Covid 19 Infections)") +
theme_dark()
# Turn it interactive with ggplotly
out <- ggplotly(p)
return(out)
}
state_plot("Alabama")
What do you feel is the inferential framework that you adopt as a scientist? Why? Include in your answer why you prefer the inferential tools (e.g. confidence intervals, test statistics, out-of-sample prediction, posterior probabilities, etc.) of your chosen worldview and why you do not like the ones of the other one. This includes defining just what those different tools mean, as well as relating them to the things you study. extra credit for citing and discussing outside sources - one point per source/point
I am inclined towards inferential thought processes that resemble Bayesian probabilistic thinking, as I tend to make causal inferences by observing patterns. Further, this inferential thought process is influenced by my prior knowledge which is analguous to a conditional probability. Although, my mind is often misled by conditional probabilities if the math is not explicitly shown. For example, I don’t perform the mental math necessary to determine a posterior probability of having a disease given a positive test result. The statistics required to formulate this posterior can be easily misinterpreted if taken in piecemeal. A low false-positive rate and a relatively high test result specificity give the impression that a P(have the disease|positive test result) is highly likely, however, the sample size of the false positive rate needs to be taken into account. After factoring in sample size, the chance of not having the disease can overwhelm the posterior and contrast my initial appraisal.
As an undergraduate scientist, Frequentist Null Hypthoesis Significance Testing has been the primary inferential framework used within my studies. For instance, hypotheses in Genetics (BIOL254) are tested by calculating probabilities for allele inheritance by applying expected and observed frequencies to the chi-square distribution using the appropriate degrees of freedom. The probability value generated from the chi-square distribution is then analyzed against a threshold value for significance. P-values below the threshold are considered statisically significant and fails to accept falsification.
#######PICK UP AT COUNT DATA AND LIKELHOOD ###################
#Calculate the Posterior
#Calculate the likelihood ####
#P(Yes|Explodes)
Likelihood = (35/36)*.0001
#Prior ~ p(Explodes) ####
Prior = .0001
#calculate the denominator - law of total probabilities for sun exploding ####
# (yes|explodes)+(no|explodes)
Marginal_Likelihood = (35/36*.0001)+(1/36)*(1-.0001)
############### CALCULATE THE POSTERIOR
#P(Explodes|Yes)= P(Yes|Explodes)p(Explodes)/p(Explodes)
Posterior = (Likelihood*Prior)/Marginal_Likelihood
The probability that the sun exploded given that the machine said yes is 3.488140310^{-7}.
4a Extra Credit (10 Points) Why is this a bad parody of frequentist statistics?
The stick-figure scientist did not consider the conditional probability of the sun exploding. The p-value of 1/36 is specific to false-positive rate, which is independent of the probablity of the sun exploding. Therefore, this is a bad parody because the null hypothesis that the sun did not explode is rejected using the independent probability of obtaining two sixes. While it is unlikely that the machine lied, it is far more unlikely that the sun went nova.
#load qual morphology data
quail_data <- read.csv(na.omit("data/Morphology data.csv"))
#structure of data frame
str(quail_data)
## 'data.frame': 880 obs. of 10 variables:
## $ Bird.. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Sex : chr "" "m" "f" "m" ...
## $ Age..days. : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Exp..Temp...degree.C.: int 15 15 15 15 15 15 15 15 15 15 ...
## $ Mass..g. : num 16.1 19.2 17.5 14.4 17.4 ...
## $ Tarsus..mm. : num 19.4 20.4 19 20.1 21.8 ...
## $ Culmen..mm. : num 7.64 7.49 7.31 7.34 8.24 6.82 7.84 7.39 7.4 7.81 ...
## $ Depth..mm. : num 4.23 4.46 3.92 3.85 4.42 3.65 3.94 3.72 4.5 4.09 ...
## $ Width..mm. : num 4.49 4.44 4.01 4.22 4.56 3.73 4.6 4.66 3.83 3.89 ...
## $ NOTES : chr "" "" "" "" ...
skimr::skim(quail_data)
| Name | quail_data |
| Number of rows | 880 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Sex | 0 | 1 | 0 | 1 | 88 | 3 | 0 |
| NOTES | 0 | 1 | 0 | 35 | 872 | 8 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Bird.. | 0 | 1.00 | 20.50 | 11.55 | 1.00 | 10.75 | 20.50 | 30.25 | 40.00 | ▇▇▇▇▇ |
| Age..days. | 0 | 1.00 | 37.00 | 31.15 | 5.00 | 15.00 | 26.00 | 49.00 | 123.00 | ▇▃▁▁▁ |
| Exp..Temp…degree.C. | 0 | 1.00 | 22.50 | 7.50 | 15.00 | 15.00 | 22.50 | 30.00 | 30.00 | ▇▁▁▁▇ |
| Mass..g. | 76 | 0.91 | 138.75 | 93.09 | 12.51 | 50.07 | 127.58 | 219.59 | 375.00 | ▇▅▅▃▁ |
| Tarsus..mm. | 112 | 0.87 | 31.73 | 6.74 | 16.19 | 26.19 | 33.81 | 37.17 | 49.42 | ▃▆▇▇▁ |
| Culmen..mm. | 114 | 0.87 | 11.73 | 2.80 | 4.23 | 9.31 | 11.98 | 14.03 | 18.14 | ▁▆▆▇▂ |
| Depth..mm. | 112 | 0.87 | 5.59 | 0.80 | 3.30 | 5.06 | 5.60 | 6.11 | 9.52 | ▂▇▆▁▁ |
| Width..mm. | 112 | 0.87 | 4.82 | 0.53 | 3.29 | 4.45 | 4.84 | 5.17 | 6.92 | ▁▆▇▂▁ |
5a) Three fits (10 points) To begin with, I’d like you to fit the relationship that describes how Tarsus (leg) length predicts upper beak (Culmen) length. Fit this relationship using least squares, likelihood, and Bayesian techniques. For each fit, demonstrate that the necessary assumptions have been met. Note, functions used to fit with likelihood and Bayes may or may not behave well when fed NAs. So look out for those errors.
#generate a plot of the quail data
quail_plot <- ggplot(data = quail_data,
mapping = aes(x = Tarsus..mm., y = Culmen..mm.)) +
geom_point(alpha = 0.5)+
theme_classic()
#plot with line
quail_plot +
stat_smooth(method=lm, formula=y~x)
The variance of residuals appear to violate homoscedasticity, evident by the increase in variation as the mean increases. This indicates a glm may be better suited, as it reweights the larger variances.
#fit the relationship of tarsus length predicting upper beak using least squares
LSquail <- lm(Culmen..mm. ~ Tarsus..mm., data = quail_data)
#simulate the data
LSquail_sims <- simulate(LSquail, nsim = 20) %>%
pivot_longer(
cols = everything(),
names_to = "sim",
values_to = "Culmen..mm."
)
#plot distribution of simulated data over observed data
ggplot() +
geom_density(data = LSquail_sims,
mapping = aes(x = Culmen..mm., group = sim),
size = 0.2) +
geom_density(data = quail_data,
mapping = aes(x = Culmen..mm.),
size = 2, color = "blue")
The simulated data does not recapitulate our observed values.
#check the relationship of residual vs. fitted values
plot(LSquail, which =1)
The residuals vs. fitted plot appears to be clustered in two regions, however, this may be due to sample size as residuals have a cloud formation. This may be suggestive of non-linearity and the need to restructure the model.
#check normality of residuals
residuals(LSquail) %>% hist()
Residuals are normally distributed.
plot(LSquail, which = 2)
This qqplot has suspect linearity.
Likelihood Regression
#Plot GLM for quail data
ggplot(quail_data,
mapping = aes(x = Tarsus..mm., y = Culmen..mm.)) +
geom_point() +
stat_smooth(method = "glm", method.args = list(family = gaussian(link="identity")))
#fit a Likelihood model using the iteratively reweighted least squares algorithm in GLM
quail_mle <- glm(Culmen..mm. ~ Tarsus..mm.,
#identity means 1:1 translation between linear predictors and shape of curve
family = gaussian(link = "identity"),
data = quail_data)
#extract predicted and residuals for plot
fitted <- predict(quail_mle)
res <- residuals(quail_mle)
qplot(fitted, res)
qqnorm(res)
The linearity of the GLM qqplot is better behaved than the LM plot
hist(res)
Residuals have a normal distribution.
Bayes Regression
options(mc.cores = parallel::detectCores())
set.seed(600)
color_scheme_set("orange")
quail_brm <- brm(Culmen..mm. ~ Tarsus..mm.,
data = quail_data,
family = gaussian(link = "identity"))
plot(quail_brm)
The parameters are behaved and the chains show convergence
#does our data match the chains for y-distributions?
color_scheme_set("green")
pp_check(quail_brm, "dens_overlay")+
theme_light()
The data contains trends that are not well-reflected in the predicted values.
#fitted vs. residual, do we meet linearity assumption?
quail_fit <- fitted(quail_brm) %>% as.data.frame()
quail_res <- residuals(quail_brm) %>% as.data.frame()
plot(quail_fit$Estimate, quail_res$Estimate)
The residuals vs fitted appears to cluster at two points, suggesting that this model violates the assumption of linearity.
qqnorm(quail_res$Estimate)
qqline(quail_res$Estimate)
5b) Three interpretations (10 points) OK, now that we have fits, take a look! Do the coefficients and their associated measures of error in their estimation match? How would we interpret the results from these different analyses differently? Or would we? Note, confint works on lm objects as well.
#LM coef ####
LSquail <- lm(Culmen..mm. ~ Tarsus..mm., data = quail_data)
coef(LSquail)
## (Intercept) Tarsus..mm.
## -0.09870712 0.37292677
#Least squares confidence intervals
confint(LSquail)
## 2.5 % 97.5 %
## (Intercept) -0.5216505 0.3242363
## Tarsus..mm. 0.3598809 0.3859727
#Likelihood ~ GLM ####
coef(quail_mle)
## (Intercept) Tarsus..mm.
## -0.09870712 0.37292677
confint(quail_mle)
## 2.5 % 97.5 %
## (Intercept) -0.5209805 0.3235663
## Tarsus..mm. 0.3599015 0.3859520
#Bayes Regression ####
fixef(quail_brm)
## Estimate Est.Error Q2.5 Q97.5
## Intercept -0.1044357 0.215493094 -0.5263282 0.3218312
## Tarsus..mm. 0.3731036 0.006681195 0.3598419 0.3861956
The coefficients of the dependent variable (Tarsus) and intercept for the three fits are approximately the same. We can interpret these results as mean Culmen size increasing by ~0.37mm for each 1mm increase in Tarsus size. The arbitrary Tarsus size of 0mm is associated with a approximately ~-.10 mean Culmen size.
The least squares and likelihood models have confidence intervals, which
5c) Everyday I’m Profilin’ (10 points) For your likelihood fit, are your profiles well behaved? For just the slope, use grid sampling to create a profile. You’ll need to write functions for this, sampling the whole grid of slope and intercept, and then take out the relevant slices as we have done before. Use the results from the fit above to provide the reasonable bounds of what you should be profiling over (3SE should do). Is it well behaved? Plot the profile and give the 80% and 95% CI (remember how we use the chisq here!). Verify your results with profileModel.
#####
#For just the slope, use grid sampling to create a profile.
#You’ll need to write functions for this, sampling the whole grid of slope and intercept,
#and then take out the relevant slices as we have done before.
#Use the results from the fit above to provide the reasonable bounds of what you should be profiling over
#(3SE should do). Is it well behaved? Plot the profile and give the 80% and 95% CI (remember how we use the chisq here!).
#Verify your results with profileModel.
norm_likelihood <- function(obs, mean_est, sd_est){
#data generating process
est <- mean_est
#log likelihood
sum(dnorm(obs, mean = est, sd = sd_est, log = TRUE))
}
quail_dist <- crossing(m = seq(30, 40, by = 0.1),
s=seq(5, 15, by = 0.1)) %>%
rowwise() %>%
mutate(log_lik = norm_likelihood(obs = na.omit(quail_data$Tarsus..mm.), mean_est = m, sd_est = s)) %>%
ungroup()
#profile for the mean
quail_sd_profile <- quail_dist %>%
#group by sd
group_by(s) %>%
#filter out mle
filter(log_lik == max(log_lik)) %>%
ungroup()
qplot(s, log_lik, data=quail_sd_profile, geom = "line")
ninetyfive_CI <- quail_sd_profile %>%
filter(log_lik > max(log_lik) - qchisq(0.95, df = 1)/2) %>%
filter(row_number()==1 | row_number()==n())
ninetyfive_CI
## # A tibble: 2 x 3
## m s log_lik
## <dbl> <dbl> <dbl>
## 1 31.7 6.5 -2555.
## 2 31.7 7 -2555.
eighty_CI <- quail_sd_profile %>%
filter(log_lik > max(log_lik) - qchisq(0.80, df = 1)/2) %>%
filter(row_number()==1 | row_number()==n())
eighty_CI
## # A tibble: 2 x 3
## m s log_lik
## <dbl> <dbl> <dbl>
## 1 31.7 6.6 -2555.
## 2 31.7 6.9 -2555.
quail_m_profile <- quail_dist %>%
group_by(s) %>%
filter(log_lik == max(log_lik)) %>%
ungroup()
qplot(s, log_lik, data=quail_m_profile, geom = "line")+
geom_vline(xintercept=6.6, color = "red")+
geom_vline(xintercept=6.9, color = "red")+
geom_vline(xintercept=6.5, color = "blue")+
geom_vline(xintercept=7, color = "blue")+
theme_minimal()
#verify 95% CI results
prof <- profileModel(quail_mle,
objective = "ordinaryDeviance",
quantile = qchisq(0.95,1))
## Preliminary iteration .. Done
##
## Profiling for parameter (Intercept) ... Done
## Profiling for parameter Tarsus..mm. ... Done
plot(prof)
#verify 80% CI results
prof80 <- profileModel(quail_mle,
objective = "ordinaryDeviance",
quantile = qchisq(0.80,1))
## Preliminary iteration .. Done
##
## Profiling for parameter (Intercept) ... Done
## Profiling for parameter Tarsus..mm. ... Done
plot(prof80)
5d) The Power of the Prior (10 points) This data set is pretty big. After excluding NAs in the variables we’re interested in, it’s over 766 lines of data! Now, a lot of data can overwhelm a strong prior. But only to a point. Show first that there is enough data here that a prior for the slope with an estimate of 0.7 and a sd of 0.01 is overwhelmed and produces similar results to the default prior. How different are the results from the original?
adj_brm_prior <- brm(Culmen..mm. ~ Tarsus..mm.,
data = quail_data,
family=gaussian(),
prior = (prior(normal(0.7, 0.01), class = b)))
adj_brm_prior
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: Culmen..mm. ~ Tarsus..mm.
## Data: quail_data (Number of observations: 766)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat
## Intercept -4.25 0.26 -4.78 -3.76 1.00
## Tarsus..mm. 0.50 0.01 0.49 0.52 1.00
## Bulk_ESS Tail_ESS
## Intercept 2545 2433
## Tarsus..mm. 2537 2498
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sigma 1.52 0.05 1.43 1.62 1.00 2311
## Tail_ESS
## sigma 2443
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
#Prepare coefficients and credible intervals from the BRMs for graphing
#adjusted prior coefficient and credible interval
adjPriorCoef <- fixef(adj_brm_prior)
adjCoef_wide <- spread_draws(adj_brm_prior,
b_Intercept,
b_Tarsus..mm.)
#default prior coefficient and credible interval
quail_coefs <- fixef(quail_brm)
quail_coefs_wide <- spread_draws(quail_brm,
b_Intercept,
b_Tarsus..mm.)
#tarsus v. culmen plot
quail_plot +
#credible interval of default prior
geom_abline(data = quail_coefs_wide,
aes(slope = b_Tarsus..mm.,
intercept = b_Intercept),
color = "blue",
alpha = 0.06) +
#median fit
geom_abline(slope = quail_coefs[2,1],
intercept = quail_coefs[1,1],
color = "red",
size = 1)+
# credible interval of adjusted prior
geom_abline(data = adjCoef_wide,
aes(slope = b_Tarsus..mm.,
intercept = b_Intercept),
color = "orange",
alpha = 0.08) +
#median fit of adjusted prior model
geom_abline(slope = adjPriorCoef[2,1],
intercept = adjPriorCoef[1,1],
color = "green",
size = 1)
Second, randomly sample 10, 100, 300, and 500 data points. At which level is our prior overwhelmed (e.g., the prior slope becomes highly unlikely)? Communicate that visually in the best way you feel gets the point across, and explain your reasoning.
#draw random samples from quail data
Functionally <- function(x) {
samples <- quail_data[sample(nrow(quail_data), x), ] %>%
na.omit()
#input into BRMS
Brm_object <- brm(Culmen..mm. ~ Tarsus..mm.,
data = samples,
family = gaussian(link = "identity"),
chains = 2,
iter = 500)
#get_prior
out <- plot(Brm_object)
return(Brm_object)
}
tensamps <- Functionally(10)
hundredsamps <- Functionally(100)
threehunsamps <- Functionally(300)
fivehunsamps <- Functionally(500)
#ten sample coefs and draws
tensampCoef <- fixef(tensamps)
tensampcoefs_wide <- spread_draws(tensamps,
b_Intercept,
b_Tarsus..mm.)
#get posterior
#100 sample coefs and draws
hundredsampCoef <- fixef(hundredsamps)
hundredcoefs_wide <- spread_draws(hundredsamps,
b_Intercept,
b_Tarsus..mm.)
#500 sample coefs and draws
fivesampCoef <- fixef(fivehunsamps)
fivesampcoefs_wide <- spread_draws(fivehunsamps,
b_Intercept,
b_Tarsus..mm.)
quail_plot +
#simulated draws from ten samps
geom_abline(data = tensampcoefs_wide,
aes(slope = b_Tarsus..mm.,
intercept = b_Intercept),
color = "orange",
alpha = 0.26) +
#median fit of ten samples
geom_abline(slope = tensampCoef[2,1],
intercept = tensampCoef[1,1],
color = "yellow",
size = 1)+
#simulated draws of 100 samples
geom_abline(data = hundredcoefs_wide,
aes(slope = b_Tarsus..mm.,
intercept = b_Intercept),
color = "purple",
alpha = 0.16) +
#median fit of 100 samples
geom_abline(slope = hundredsampCoef[2,1],
intercept = hundredsampCoef[1,1],
color = "white",
size = 1)+
# credible interval of 500 samps
geom_abline(data = fivesampcoefs_wide,
aes(slope = b_Tarsus..mm.,
intercept = b_Intercept),
color = "blue",
alpha = 0.08) +
#median fit of 500 samps
geom_abline(slope = fivesampCoef[2,1],
intercept = fivesampCoef[1,1],
color = "green",
size = 1)
The weakly informed prior has low certainty of estimating parameters when the sample size is ten. The certainty in parameter estimation increases as the prior becomes updated with new information and around 300 samples our estimate interval begins to resemble the theta of the entire data set. Judging by the confidence bands, any model made with 100 samples should generalize to new data fairly well.
+4 for a function that means you don’t have to copy and paste the model over and over. + 4 more if you use map() in combination with a tibble to make this as code-efficient as possible. This will also make visualization easier.
The culmen-tarsus relationship is cubic
Brm_object1 <- brm(Culmen..mm. ~ poly(Tarsus..mm.,3),
data = na.omit(quail_data),
family = gaussian(link = "identity"))
color_scheme_set("green")
pp_check(Brm_object1, "dens_overlay")+
theme_light()
This is a much better fit than the linear model from before.
quail_waic <- waic(quail_brm)
quail_waic
##
## Computed from 4000 by 766 log-likelihood matrix
##
## Estimate SE
## elpd_waic -1252.7 21.9
## p_waic 3.1 0.3
## waic 2505.4 43.8
quail_loo <- loo(quail_brm)
quail_loo
##
## Computed from 4000 by 766 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1252.7 21.9
## p_loo 3.1 0.3
## looic 2505.4 43.8
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
poly_loo <- loo(Brm_object1)
#
compare(quail_loo, poly_loo)
## elpd_diff se
## 11.9 5.3
#####citations https://link.springer.com/article/10.1023/B:BREA.0000014042.54925.cc (2/3 ~ 3 BCC lines)